gh-142183: Change data stack to use a resizable array by dpdani · Pull Request #148681 · python/cpython

dpdani · 2026-04-17T12:24:36Z

This PR changes the implementation of the Python stack to use a resizable array. This avoids the problem of calls that frequently cause the datastack_top (now called stack_top) pointer to switch between allocations.

After resizing, previous array allocations are not immediately freed because that would cause issues for various bits around the VM still pointing into them, and are instead freed along with the tstate.

During resizing, the previous contents of the stack are not copied into the new allocations, and instead the memory of the previous allocation is still used. Subsequently, popping and pushing frames, the new frames will always be residing on the new stack chunk allocation.

Overall it results in a ±1% performance change (within the noise range), but it avoids degenerate cases for any number of frames. I am also told it would allow further optimizations in the JIT.

Issue: Calls across stack chunks perform badly #142183

python-cla-bot · 2026-04-17T12:24:41Z

All commit authors signed the Contributor License Agreement.

dpdani · 2026-04-27T13:26:24Z

@pablogsal can you take a look this week? 🙏

pablogsal · 2026-04-27T13:27:38Z

@pablogsal can you take a look this week? 🙏

I can try but I have some other PRs first in my review queue :(

pablogsal

Thanks for working on this. I am worried about a couple of consequences that I think we should account for before continuing with this:

A minor concern is that this seems to change Python stack memory from “roughly current depth” usage to “high-water mark for the lifetime of the thread”. In the old chunked implementation, deep recursion allocated additional 16 KiB chunks and then released most of them while unwinding. In this version, resize_stack() keeps previous stack chunks linked from stack_chunk_list, and _PyThreadState_PopFrame() only moves stack_top; the chunks are not freed until the thread state is deleted.

This doesn't seem to be a lot so I am not too worried.

The worse concern I have I’m also concerned about profiler/debugger consequences. The old stack chunk layout allowed _remote_debugging/external unwinders to bulk-copy stack chunks cheaply. With this change, the active frame chain can span older chunks while only the newest chunk is copied in the new _remote_debugging path, so older frames fall back to individual remote reads. For a 1000-frame stack I measured:

old no-cache unwinding: 4 memory reads, ~1.2 KiB read
new no-cache unwinding: 966 memory reads, ~85.8 KiB read

Tachyon is probably mostly insulated because it uses frame caching, but first samples, cache-disabled paths, fallback paths, and external tools still care. Other profilers like austin currently hard-codes the old _PyStackChunk layout (previous, size, top, data), while this patch changes it to (size, previous, data), so those tools need explicit updates.

Given this and the potential gains I do not find the tradeoff very convincing....

bedevere-app · 2026-04-27T22:52:03Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

pablogsal · 2026-04-27T22:54:14Z

-        chunk_addr = GET_MEMBER(uintptr_t, chunks[count].local_copy, offsetof(_PyStackChunk, previous));
-        count++;
+    // Process this chunk
+    if (process_single_stack_chunk(unwinder, chunk_addr, &chunks[count]) < 0) {


Unnless I’m missing something, stopping after a single chunk here looks like a large perf regression for the profiler. The runtime still has a linked chunk chain via stack_chunk_list -> previous, but this now only copies the newest chunk. If the active frame chain spans older chunks, find_frame_in_chunks() misses those frames and we fall back to parse_frame_object(), which does one remote memory read per frame.

pablogsal · 2026-04-27T22:59:20Z

Unfortunately the more I think about it the less I like it: this model is harder to reason about than the previous linked-list model, and I think that matters because it creates a very easy footgun: the current chunk looks like the current stack backing store, but it is not.

Existing live frames may still be in older chunks, while newer frames are in the newest chunk at a matching logical offset. So any code that treats stack_chunk_list as “the stack” instead of “the head of a chain that may contain live frames” can silently become wrong. That already seems to happen here: copying only the head chunk misses older live frames and turns what used to be a cheap bulk-copy unwind into many per-frame remote reads.

The risk is not just external profilers. This makes future runtime/debugger code more fragile because pointer validity and frame ownership now depend on searching the whole chunk chain, not checking the current chunk.

markshannon · 2026-04-28T09:24:21Z

@pablogsal
I don't understand why this PR would cause 1000 remote reads. A 1000 frames stack will only span a few chunks, since they grow exponentially. For very large stacks, there will be fewer chunks to copy, not more.

Existing live frames may still be in older chunks, while newer frames are in the newest chunk at a matching logical offset.

There seems to be a misunderstanding here. No two frames will have the same offset.

This makes future runtime/debugger code more fragile because pointer validity and frame ownership now depend on searching the whole chunk chain, not checking the current chunk.

The stack is a linked list of frames, not chunks. Some of which (generator and coroutine frames) aren't in chunks at all, so tools already need to handle pointers outside of the current chunk.

Overall, I don't see how this really changes anything for an out-of-process profiler: Copy all the chunks, then traverse the stack.

Also, note that the lower, unused part of the current chunk in a stack with multiple chunks will be untouched, so a profiler should be able to detect when it needs to cross to another chunk.

Other profilers like austin currently hard-codes the old _PyStackChunk layout (previous, size, top, data), while this patch changes it to (size, previous, data), so those tools need explicit updates.

That is trade off that tools and libraries make: either they use stable API/ABIs, or they probe CPython internals. If they do the latter, they will need updating every release.

@P403n1x87 would this be a problem for you?

pablogsal · 2026-04-28T10:12:11Z

I don't understand why this PR would cause 1000 remote reads. A 1000 frames stack will only span a few chunks, since they grow exponentially. For very large stacks, there will be fewer chunks to copy, not more.

The issue I was pointing at is the current _remote_debugging implementation in this PR: copy_stack_chunks() now processes only the head chunk and sets out_chunks->count = 1. Once frame->previous points into an older chunk, find_frame_in_chunks() misses and we fall back to parse_frame_object(), which reads the frame remotely. That is where the ~966 remote reads for a 1000-frame stack came from.

The fix here is to either restore eager copying of the full stack_chunk_list -> previous chain using the new _PyStackChunk layout, or lazily copy previous chunks when the frame walk crosses out of the copied head chunk.

There seems to be a misunderstanding here. No two frames will have the same offset.

Also, you’re right that “same logical offset” was poor wording. I meant that the new chunk starts allocating at the old stack depth, leaving the lower part of the new chunk unused; not that two frames have the same offset.

pablogsal · 2026-04-28T10:15:41Z

Here is a repro:

import os
import subprocess
import sys
import _remote_debugging

DEPTH = 1000

child_code = f"""
import os, sys, time
sys.setrecursionlimit({DEPTH + 1000})

def f(n):
    if n == 0:
        print("READY", os.getpid(), flush=True)
        time.sleep(10)
        return
    f(n - 1)

f({DEPTH})
"""

p = subprocess.Popen(
    [sys.executable, "-c", child_code],
    stdout=subprocess.PIPE,
    text=True,
)

line = p.stdout.readline().strip()
pid = int(line.split()[1])

try:
    unwinder = _remote_debugging.RemoteUnwinder(
        pid,
        all_threads=True,
        cache_frames=False,
        stats=True,
        debug=True,
    )
    trace = unwinder.get_stack_trace()
    frames = sum(len(t.frame_info) for interp in trace for t in interp.threads)

    print("frames:", frames)
    print("stats:", unwinder.get_stats())
finally:
    p.terminate()
    p.wait()

In main we can see:

❯ ./python repro.py
frames: 1002
stats: {'total_samples': 1, 'frame_cache_hits': 0, 'frame_cache_misses': 0, 'frame_cache_partial_hits': 0, 'frames_read_from_cache': 0, 'frames_read_from_memory': 0, 'memory_reads': 4, 'memory_bytes_read': 1200, 'code_object_cache_hits': 1000, 'code_object_cache_misses': 2, 'stale_cache_invalidations': 0, 'frame_cache_hit_rate': 0.0, 'code_object_cache_hit_rate': 99.8003992015968}

with this PR:

frames: 1002
stats: {'total_samples': 1, 'frame_cache_hits': 0, 'frame_cache_misses': 0, 'frame_cache_partial_hits': 0, 'frames_read_from_cache': 0, 'frames_read_from_memory': 0, 'memory_reads': 966, 'memory_bytes_read': 85768, 'code_object_cache_hits': 1000, 'code_object_cache_misses': 2, 'stale_cache_invalidations': 0, 'frame_cache_hit_rate': 0.0, 'code_object_cache_hit_rate': 99.8003992015968}

Notice memory_reads going from 4 to 966

dpdani · 2026-04-28T10:16:04Z

The fix here is to either restore eager copying of the full stack_chunk_list -> previous chain using the new _PyStackChunk layout, or lazily copy previous chunks when the frame walk crosses out of the copied head chunk.

Yes, that's my bad. I made the smallest changes possible to the remote debugging module to get the PR working, but didn't inspect further improvements. Maybe it can be done in a follow-up PR by people more knowledgeable on the module? Or would you consider that a blocker?

pablogsal · 2026-04-28T10:16:32Z

Maybe it can be done in a follow-up PR by people more knowledgeable on the module? Or would you consider that a blocker?

This is a blocker. This PR adds a regression and that's not acceptable.

pablogsal · 2026-04-28T10:22:54Z

@dpdani I pushed a fix for the concrete _remote_debugging regression: it now walks tstate->stack_chunk_list through previous and copies all stack chunks before traversing frames, instead of only copying the newest chunk. With the fix, the 1000-recursive-frame repro goes from roughly ~966 remote reads / ~85 KB read to 3 reads.

I also added a regression test to make sure deep stacks are resolved from copied chunks rather than falling back to parsing frames individually from remote memory.

pablogsal · 2026-04-28T10:24:21Z

That said, I still think this solution is very confusing and too complex. It is harder to reason about where frames live, which chunks are relevant, and what invariants the implementation can rely on. I am worried this makes future changes easier to get subtly wrong.

pablogsal · 2026-04-28T10:55:09Z

I investigated an alternative implementation in #149097: instead of replacing the stack with the resizable-array model, it keeps the current chunked- stack invariants and extends the existing one-chunk cache into a small bounded per-thread cache.

I think this is a better direction because it fixes the allocator-thrashing issue without changing where live frames can reside, without changing _PyStackChunk/PyThreadState layout, and without requiring _remote_debugging or external unwinders to learn a new stack model. Active frames remain only in the datastack_chunk -> previous chain; cached chunks are detached and inactive.

The cache is intentionally bounded: currently up to 8 * _PY_DATA_STACK_CHUNK_SIZE, so the memory cost is predictable and much smaller than retaining a high-water-mark stack for the lifetime of the thread.

I cehcked the original repro and it no longer shows per-branch mmap/munmap churn.

markshannon · 2026-04-28T13:10:30Z

To fix the really bad behavior when we are constantly crossing the boundary between chunks, a single cached chunk is sufficient. See #145828

However, calls that cross chunk boundaries will prevent specialization and jitting code as the code keeps having to hit the slow path of not enough space in the current chunk.
Adding more chunks doesn't help, we need bigger chunks so that ~~both caller and callee frame are in the same chunk~~ there is space for the callee in the current chunk.

Also, it allows us to use smaller initial chunks, as don't need to worry about the cost of crossing chunk boundaries.
This doesn't matter much now, but with the increasing prevalence of threading, reducing the memory footprint of individual frames will become more important.

markshannon · 2026-04-28T13:15:23Z

I still think this solution is very confusing and too complex.

I don't think so, I think it is fairly elegant. But it is a bit subtle.
Maybe we just need it better documented (with pictures)?

dpdani · 2026-04-28T13:18:58Z

I'll update https://github.com/python/cpython/blob/main/InternalDocs/frames.md

markshannon · 2026-04-28T13:25:23Z

@pablogsal would it help if the lower, unused part of the chunk were zeroed-out?
That way you have an easy check for when you need to switch chunks, as the bottom of the used part of the current chunk can only grow down.

pablogsal · 2026-04-28T14:42:13Z

I don't think so, I think it is fairly elegant. But it is a bit subtle.
Maybe we just need it better documented (with pictures)?

Yes I think with more docs we can make this much more palatable. I want to say that I may be in the minority here so it would be great to get a 2nd opinion from @Fidget-Spinner regarding the complexity argument.

@pablogsal would it help if the lower, unused part of the chunk were zeroed-out?
That way you have an easy check for when you need to switch chunks, as the bottom of the used part of the current chunk can only grow down.

This is how I understand the tradeoff:

The benefit of this approach over the chunk-cache approach is that it addresses a different problem. A cached chunk fixes the allocator-thrashing case, but a hot call near a chunk boundary can still keep failing _PyThreadState_HasStackSpace() and miss specialization/JIT opportunities. Growing the current stack area gives that logical stack depth more headroom, so it can help with the slow-path/JIT issue in a way the cache approach cannot.

The part that still makes me prefer the cache approach is the change in stack/profiler invariants. With the cache approach, inactive chunks are detached, and the active datastack_chunk -> previous chain remains close to the current live stack. With this PR, stack_chunk_list -> previous becomes more of a high-water backing-store chain: fewer chunks, but potentially larger chunks, and older chunks can remain linked after the stack has unwound. That is workable, but it is more subtle and can increase what profilers nee to copy.

On zeroing the lower unused part: I think it could help as a defensive/debugging aid, but I would not rely on it as the correctness mechanism. A profiler/unwinder should still follow _PyInterpreterFrame.previous and then locate that address in the copied stack_chunk_list -> previous chain. A zeroed lower prefix can tell us “there is no frame here”, but it does not identify the previous chunk or replace following the real frame pointer.

It may still be worth doing if we want the unused prefix to be deterministic and make heuristic scans fail cleanly. The main caveat is cost: explicit zeroing can touch pages that would otherwise remain lazily zeroed by the OS, so I would want to avoid zeroing large regions unless we measure that it is negligible.

Perhaps we can alleviate the fact that profilers will need to copy "old zeroed chunks" byt keeping a pointer to the start of the "new" chunks? I can try to profile a bit....maybe I am overreacting to this and the difference is negligible. Will investigate

P403n1x87 · 2026-04-29T10:58:54Z

+    const intptr_t offset = ptr - start;
+    const intptr_t usable_size = (intptr_t)(chunk->size - _PY_STACK_CHUNK_OVERHEADS);
+    return offset >= 0 && offset < usable_size && start + offset == ptr;


Suggested change

const intptr_t offset = ptr - start;

const intptr_t usable_size = (intptr_t)(chunk->size - _PY_STACK_CHUNK_OVERHEADS);

return offset >= 0 && offset < usable_size && start + offset == ptr;

const uintptr_t usable_size = (uintptr_t)(chunk->size - _PY_STACK_CHUNK_OVERHEADS);

return ptr >= start && (uintptr_t)(ptr - start) < usable_size;

As I have learned while working on this PR, those kinds of pointer comparisons are undefined behaviour according to the C standard, because they do not belong to the same object. See Section 6.5.8, paragraph 5 (https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf).

So ptr >= start is undefined, but computing where the pointer would have been in the new allocation and checking for equality is ok.

P403n1x87 · 2026-04-29T11:10:59Z

That is trade off that tools and libraries make: either they use stable API/ABIs, or they probe CPython internals. If they do the latter, they will need updating every release.

@P403n1x87 would this be a problem for you?

As any change of this nature, it is indeed a problem, but as you point out, it is something that maintainers of this kind of tools have to put up with anyway. For the case of Austin specifically, the crucial aspect is being able to work out quickly all the memory chunks with frame data that needs to be copied eagerly to ensure as much data coherence as possible (Austin does not pause threads when unwinding the Python stack). Once that data has been moved over we can work out what to do with it.

pythongh-142183: Change data stack to use a resizable array

cfaca85

dpdani requested review from FFY00, ZeroIntensity, ericsnowcurrently, markshannon and pablogsal as code owners April 17, 2026 12:24

bedevere-app Bot added the awaiting review label Apr 17, 2026

bedevere-app Bot mentioned this pull request Apr 17, 2026

Calls across stack chunks perform badly #142183

Open

📜🤖 Added by blurb_it.

bb08739

pablogsal self-assigned this Apr 17, 2026

skip tests failing on emscripten

99bb502

dpdani requested review from ambv and lysnikolaou as code owners April 20, 2026 12:33

Merge branch 'main' into pythongh-142183-stack-resizable-array

daa182a

pablogsal requested changes Apr 27, 2026

View reviewed changes

bedevere-app Bot removed the awaiting review label Apr 27, 2026

bedevere-app Bot added the awaiting changes label Apr 27, 2026

pablogsal reviewed Apr 27, 2026

View reviewed changes

pablogsal force-pushed the gh-142183-stack-resizable-array branch from 8115b28 to 239e2eb Compare April 28, 2026 10:21

Fix regression in profiler module

99e9e44

pablogsal force-pushed the gh-142183-stack-resizable-array branch from 239e2eb to 99e9e44 Compare April 28, 2026 10:46

P403n1x87 reviewed Apr 29, 2026

View reviewed changes

Uh oh!

Conversation

dpdani commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

python-cla-bot Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dpdani commented Apr 27, 2026

Uh oh!

pablogsal commented Apr 27, 2026

Uh oh!

pablogsal left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bedevere-app Bot commented Apr 27, 2026

Uh oh!

pablogsal Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

pablogsal commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markshannon commented Apr 28, 2026

Uh oh!

pablogsal commented Apr 28, 2026

Uh oh!

pablogsal commented Apr 28, 2026

Uh oh!

dpdani commented Apr 28, 2026

Uh oh!

pablogsal commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pablogsal commented Apr 28, 2026

Uh oh!

pablogsal commented Apr 28, 2026

Uh oh!

pablogsal commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markshannon commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markshannon commented Apr 28, 2026

Uh oh!

dpdani commented Apr 28, 2026

Uh oh!

markshannon commented Apr 28, 2026

Uh oh!

pablogsal commented Apr 28, 2026

Uh oh!

P403n1x87 Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

dpdani Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

P403n1x87 commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dpdani commented Apr 17, 2026 •

edited

Loading

python-cla-bot Bot commented Apr 17, 2026 •

edited

Loading

pablogsal left a comment •

edited

Loading

pablogsal commented Apr 27, 2026 •

edited

Loading

pablogsal commented Apr 28, 2026 •

edited

Loading

pablogsal commented Apr 28, 2026 •

edited

Loading

markshannon commented Apr 28, 2026 •

edited

Loading